Context

The data we will be using through the pratical classes comes from a small relational database whose schema can be seen below: alt text

Reading the Data

Metadata

Initial Analysis

Pandas user guide: https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html

Pandas 10 min tutorial: https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html

Problems:

Take a closer look and point out possible problems:

(hint: a missing values in pandas is represented with a NaN value)

Visual Exploration

Matplotlib tutorials: https://matplotlib.org/3.3.1/tutorials/index.html

Matplotlib gallery: https://matplotlib.org/3.3.1/tutorials/introductory/sample_plots.html#sphx-glr-tutorials-introductory-sample-plots-py

Seaborn tutorials: https://seaborn.pydata.org/tutorial.html

Seaborn gallery: https://seaborn.pydata.org/examples/index.html

Pyplot-style vs Object-Oriented-style

Numeric Variables' Univariate Distribution

What information can we extract from the plots above?

Insights:

Pairwise Relationship of Numerical Variables

Insights:

Categorical/Low Cardinality Variables' Absolute Frequencies

What information can we extract from the plot above?

Using the same logic from the multiple box plot figure above, build a multiple bar plot figure for each non-metric variable:

Insights:

Metric Variables' Correlation Matrix

Coherence Check

Outliers

Missing Values

Feature Engineering